fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output#1444
Conversation
…d of decoded codec output
`gc.scan_hash_references` and `gc.scan_schema_references` were calling
`table.to_arrays(attr_name)`, which routes through `Expression.to_dicts`
(`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for
each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`,
`<npy@>`, `<object@>`) this means downloading the bytes from external
storage and deserializing them into the codec's runtime type — a
`numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local
file-path string — none of which satisfy `_extract_*_refs`'s
`isinstance(value, dict) and "path" in value` check.
Both helpers therefore silently returned empty reference sets. Every
populated schema reported `hash_referenced: 0` and
`schema_paths_referenced: 0`, so every external file looked orphaned to
`scan()` and a subsequent `collect()` would have deleted live data. The
broad `try/except Exception` around the loop never fired because no
exception was raised — `_extract_*_refs` returns `[]` for unrecognized
shapes by design.
The intended design (per `reference/specs/type-system.md`) is for GC to
operate on the raw stored JSON metadata, not the decoded payload.
Replace `table.to_arrays(attr_name)` with
`table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor
yields the JSON column value directly: a Python dict on PostgreSQL
(JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs`
already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict
branch), so this is backend-agnostic with no adapter dispatch.
Side effect — `scan` is now a metadata-only operation. Previously it
downloaded every external blob just to deserialize and discard the
result via the silent type mismatch; on a 1 TB schema that meant 1 TB
of egress to produce `referenced: 0`. After this change, scan touches
only the JSON column on the database.
Custom-codec authors are unaffected: reference discovery operates on
the raw stored metadata regardless of what the codec's `decode()`
returns, so third-party codecs following the documented
`encode`/`decode` contract get correct GC for free.
Tests
-----
The existing `tests/integration/test_gc.py` mocks `scan_hash_references`,
`scan_schema_references`, and `list_*_paths` directly, so the production
code path through `to_arrays` → `decode_attribute` was never exercised
end-to-end. The mocked tests stay (they cover orchestration: composition
with `list_*_paths`, dry-run vs real, stat-key shape, format strings).
Add a `TestScanWithLiveData` class with three non-mocked end-to-end
tests, one per structurally distinct decoded-value type:
- `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray)
- `test_scan_finds_active_npy_reference` — `<npy@>` (decode → NpyRef)
- `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef)
Each declares a small manual table, inserts one row, and asserts
`scan(schema, store_name='local')` reports the expected `*_referenced`
count > 0. Verified to fail on the pre-fix code:
`{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`.
Adjacent
--------
Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The
`gc.py` module docstring and the user docs at
`how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`,
which previously raised `AttributeError` because `gc` wasn't lazily
exposed at the package level. Pattern matches the existing
`"diagram": (".diagram", None)` entry.
Out of scope
------------
GC remains non-transaction-safe even after this fix — there is a TOCTOU
window between scan and delete during which a concurrent transaction
could insert a row referencing what looks like an orphan. A two-phase
retrieval/removal API (quarantine → grace window → purge) is the right
remedy and will be tracked as a separate enhancement issue.
Fixes datajoint#1442
dimitri-yatsenko
left a comment
There was a problem hiding this comment.
Approving — the fix is correct, the test scaffold proves the regression, and the framing is right.
Fix shape. table.proj(attr).cursor(as_dict=True) bypasses decode_attribute and yields the raw JSON column value — dict on PostgreSQL/JSONB, string on MySQL. Both shapes already handled by _extract_*_refs (gc.py:138 string branch, gc.py:145 dict branch). Custom-codec authors keep the documented encode/decode contract; reference discovery now works regardless of what decode() returns. As a free byproduct, scan becomes metadata-only — no more downloading every external blob to discard the deserialized result.
Test gap closed honestly. The pre-existing tests at test_gc.py:163-166 patched the scan helpers and list_*_paths, so the production path was never exercised. TestScanWithLiveData fills that with three e2e tests against real schemas with live external storage, one per structurally distinct decoded-value type (<blob@> → ndarray, <npy@> → NpyRef, <object@> → ObjectRef). Per the PR body, the same three tests fail on the buggy version — that's the right way to prove a regression test.
Lazy module registration of gc in __init__.py is a clean addition matching the existing diagram pattern and unblocks dj.gc.scan(...) as the docs already invoke it.
Out-of-scope TOCTOU is honestly flagged. The two-phase quarantine→grace-window→purge enhancement is the right next step; please file the follow-up issue once this lands.
Two minor notes:
<attach@>(decodes to local-pathstr) and<hash@>aren't in the e2e suite. Not blocking — the cursor approach is decoded-type-agnostic, so the three tests already prove the mechanism. Worth one line in the PR body acknowledging the scope choice.- Strip the 🤖 trailer and any
Co-Authored-Byline at squash-merge time, per project convention.
This is a real data-loss recovery for users with <blob@>/<npy@>/<object@> columns and GC active. Land it.
Summary
Fixes #1442.
gc.scan_hash_referencesandgc.scan_schema_referenceswere callingtable.to_arrays(attr_name), which routes throughExpression.to_dicts→decode_attribute→ the codec'sdecode(). For the storage codecs (<blob@>,<hash@>,<attach@>,<npy@>,<object@>) that returns the deserialized payload —numpy.ndarray,NpyRef,ObjectRef,bytes, or local file-pathstr. None of those satisfy_extract_*_refs'sisinstance(value, dict) and "path" in valuecheck, so both helpers silently returned empty reference sets andcollect()would have classified live data as orphaned.This PR replaces the call with
table.proj(attr_name).cursor(as_dict=True)in both helpers. The cursor yields the raw JSON column value: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL — both already handled by_extract_*_refs(gc.py:138string branch,gc.py:145dict branch). Backend-agnostic, custom-codec-safe, and turns scan into a metadata-only operation (no more downloading every external blob just to discard the deserialized result).Also registers
gcin_lazy_modules(src/datajoint/__init__.py) sodj.gc.scan(...)works as both the module docstring and the user docs athow-to/garbage-collection.mdinvoke it. Pattern matches the existing"diagram": (".diagram", None)entry.Test plan
pytest tests/integration/test_gc.py -k "not TestScanWithLiveData"— 26 existing mocked tests pass unchanged (orchestration coverage stays intact).pytest tests/integration/test_gc.py::TestScanWithLiveData -v— 3 new non-mocked e2e tests pass with the fix:test_scan_finds_active_blob_reference—<blob@>(decode →numpy.ndarray)test_scan_finds_active_npy_reference—<npy@>(decode →NpyRef)test_scan_finds_active_object_reference—<object@>(decode →ObjectRef)gc.pyis reverted to the buggy version. The failure mode reproduces the issue's exact symptom —hash_referenced: 0,schema_paths_referenced: 0, every file classified as orphaned.pre-commit run(ruff, ruff-format, codespell) clean.Test gap closed
tests/integration/test_gc.py:163-166patchesscan_hash_references,scan_schema_references, andlist_*_pathsdirectly, so the production path throughto_arrays→decode_attributewas never exercised end-to-end. That's why this slipped through. The mocked tests stay (orchestration: composition withlist_*_paths, dry-run vs real, stat-key shape, format strings). The newTestScanWithLiveDataclass fills the gap with one e2e test per structurally distinct decoded-value type.Performance side-effect (free byproduct)
Before: scan downloaded every external blob just to deserialize and discard it via the silent type mismatch — a 1 TB schema produced 1 TB of egress to report
referenced: 0. After: scan touches only the JSON column on the database during reference enumeration.Custom-codec authors
Reference discovery now operates on raw stored metadata regardless of what a codec's
decode()returns. Third-party codecs following the documentedencode/decodecontract get correct GC for free — no contract change required.Out of scope
GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be filed as a separate enhancement issue right after this PR opens.
🤖 Generated with Claude Code